This lesson will guide students through the process of merging two messy dataframes by a common key using merge.
Prerequisites
Teaching Objectives
By the end of this lecture, students should be able to:
Roadmap
Scenario
!!!
Our gathering of famous duos has been beset by data goblins, resulting in all of our duos being separated! They have gathered together in two locations (dataframes), the california and massachusetts. Every duo wants to find their way back home, and to reunite with their opposite.
Practically, we want to use the data provided to produce a single dataframe, that has one row for each duo pair, one column for each member, their associated color, and hometown. And example of our desired output can be seen here.
Desired Output Example
Please run the following code to download our class data, and prepare it for the lesson.
# Load in student survey csv
.raw_survey = read.csv("https://docs.google.com/spreadsheets/d/1E8ME8ldv8Pv4-saL-ZYHJckmVPNgcNd53LasYiW0jwU/export?format=csv&usp=sharing", header = TRUE, stringsAsFactors = FALSE)
# Let loose the data gremlins
# Don't look at this file until after lecture! Here there be spoilers.
source("https://raw.githubusercontent.com/Epsian/table_join_lecture/main/src/1_split_duos.R")
Our duos were split, and individuals were cast off into to different locations (dataframes), named california and massachusetts. Let’s start by looking at them. Each dataframe has 4 columns: the name of the individual, a color, the name of their alter (pairing), and their hometown.
Starting with the california:
head(california, 10)
## name color alter hometown
## 1 Sherlock Blue <NA> LONDON
## 2 Batman Red Robin GOTHAM
## 3 Mandy Red <NA> EVILLE
## 4 Aang Orange Katara REPUBLIC CITY
## 5 Link Green <NA> CASTLE TOWN
## 6 Han Solo Blue <NA> CORELLIA
## 7 Kiki Red <NA> KORIKO
## 8 Vash the Stampede Red Nicholas D. Wolfwood A DESERT PLANET
## 9 Konrad Franco Red Konstanze Conradt SANTA ROSA
## 10 Scooby Doo Orange Shaggy COOLSVILLE
str(california)
## 'data.frame': 11 obs. of 4 variables:
## $ name : chr "Sherlock" "Batman" "Mandy" "Aang" ...
## $ color : chr "Blue" "Red" "Red" "Orange" ...
## $ alter : chr NA "Robin" NA "Katara" ...
## $ hometown: chr "LONDON" "GOTHAM" "EVILLE" "REPUBLIC CITY" ...
And then the massachusetts:
head(massachusetts, 10)
## name color alter hometown
## 1 Nicholas D. Wolfwood Blue Vash the Stampede A_Desert_Planet
## 2 Watson Indigo Sherlock London_(13)
## 3 JiJi Indigo <NA> Koriko
## 4 Billy Blue <NA> Eville
## 5 Shaggy Green Scooby Doo Coolsville
## 6 Chewbacca Orange <NA> Corellia
## 7 Harry Potter Red Hermione Granger HOGWARTS
## 8 Konstanze Conradt Orange Konrad Franco Santa_Rosa
## 9 Hermione Granger Yellow <NA> Hogwarts
## 10 Robin Yellow Batman Gotham
str(massachusetts)
## 'data.frame': 15 obs. of 4 variables:
## $ name : chr "Nicholas D. Wolfwood" "Watson" "JiJi" "Billy" ...
## $ color : chr "Blue" "Indigo" "Indigo" "Blue" ...
## $ alter : chr "Vash the Stampede" "Sherlock" NA NA ...
## $ hometown: chr "A_Desert_Planet" "London_(13)" "Koriko" "Eville" ...
Checkpoint
Our eventual goal is to join the two dataframes above, and have a single row for each duo pair, one column for each member, and one column for their hometown. How can we get there? Let’s look at a few possibilities.
If you would like to test these dataframes within R yourself, you can run this code. The results will be identical to the figures.
source("https://raw.githubusercontent.com/Epsian/table_join_lecture/main/src/2_merge_examples.R")
rbindIf you have two dataframes and want to combine them, you need to decide what direction you want to combine them in. If you would like to combine them vertically, or stack the rows on top of each other, you can use rbind or row bind. rbind is helpful if you have two dataframes with the same columns, and you want to combine the cases (rows). However, it will not work if the columns are different. rbind accepts an arbitrary number of dataframes, here we use two: upper_rbind_df and lower_rbind_df.
outcome = rbind(upper_rbind_df, lower_rbind_df)
rbind example 1
cbindIf you would like to combine two dataframes horizontally, so that you add more columns on to a dataframe, you can use cbind or column bind. The cbind function takes an arbitrary number of dataframes as it’s arguments. We’ll provide just our two here, but try repeating one multiple times if you would like to see what happens.
outcome = cbind(left_cbind_df, right_cbind_df)
cbind example 1
We can see out new outcome dataframe was created as expected. This method can work, but you must be very confident about the structure of your data. For example, in the following figure, what issues do you expect to run into with this method?
cbind example 2
mergecbind may work in some situations, but is very “brittle” – or easy to break – with any changes to the data. A more reliable method of merging makes use of a common key between two sources of data. The key can be anything, like a numerical ID or a string, but they must always be unique. Joins that use the key method will be familiar to anyone who has used SQL in the past, as they follow the same naming convention as in that language.
An inner join tries to join two sets of data using a common key, keeping only the data that exists in both of the original data sets. This is commonly illustrated using a venn diagram similar to the one below. Only the area highlighted with green will be included in the output.
Inner Join - Venn Diagram
In the context of our data, it might look something like the following if we use hometown as our key. We can use the following code to do an inner merge using the merge function. The merge function in R requires two arguments, x and y, which are the dataframes we would like to merge. We will be using left_merge_df and right_merge_df, and asking merge to use hometown as the key to merge by.
outcome = merge(x = left_merge_df, y = right_merge_df, by = "hometown")
Inner Join - Example Data
Everything seems in order. We can see that when using an inner join, cases where both dataframes have a single row with our key will be matched and joined into a single output dataframe. However, if there are rows in either data frame without a match, those rows will be dropped from our data.
An Outer Join is the opposite of an inner join. Rather than just looking for those rows which have a key in common, it will join every row, regardless of the keys, inserting blank values where there is no match. You can get a sense of this merge with the venn diagram below, where the green indicates good matches, while the orange indicates partial matches.
Outer Join - Venn Diagram
In the context of our data, an outer join may look something like the following, again using hometown as our key. The code for an outer join is similar to an inner join, with one addition. We will want to specify that all = TRUE, essentially saying we want everything included.
outcome = merge(x = left_merge_df, y = right_merge_df, by = “hometown”, all = TRUE)
Outer Join - Example Data
In this example, we see that an outer join will keep all available cases, but we introduce NA values into the resulting dataframe where no match could be made. This may or may not be an issue depending on what your next steps are.
A left join allows you to pick one of the two dataframes you are joining and prioritize it. It essentially takes all of the cases in the dataframe on the ‘left’ side, and searches in ‘right’ dataframe to join what it can. Keeping with the venn diagram representations, it would look like the following:
Left Join - Venn Diagram
Using our example data, a left join would look like this. The R code is similar, to the previous examples, but we specify that instead of all = TRUE, we just want all.x = TRUE. Recall that in the function call, x is the first dataframe, or the “left” one.
outcome = merge(x = left_merge_df, y = right_merge_df, by = “hometown”, all.x = TRUE)
Left Join - Example Data
We can see that while all of the data from our ‘left’ side is preserved, unmatched data on the ‘right’ is discarded. You can also technically do a right join, which will do the same thing with sides reversed, but moving your prioritized data set to the left is more common.
Checkpoint
Let’s clean up our environment a bit using the following code:
rm(e1_cbind, e2_cbind, e3_cbind, e1_key, e2_key, e3_inner, e3_outer, e3_left)
Now that we have a few possible tools, we can start considering how we are going to join our duos back together. Let’s glance back at our california and massachusetts dataframes to get an idea of what we’re working with.
head(california, 10)
## name color alter hometown
## 1 Sherlock Blue <NA> LONDON
## 2 Batman Red Robin GOTHAM
## 3 Mandy Red <NA> EVILLE
## 4 Aang Orange Katara REPUBLIC CITY
## 5 Link Green <NA> CASTLE TOWN
## 6 Han Solo Blue <NA> CORELLIA
## 7 Kiki Red <NA> KORIKO
## 8 Vash the Stampede Red Nicholas D. Wolfwood A DESERT PLANET
## 9 Konrad Franco Red Konstanze Conradt SANTA ROSA
## 10 Scooby Doo Orange Shaggy COOLSVILLE
head(massachusetts, 10)
## name color alter hometown
## 1 Nicholas D. Wolfwood Blue Vash the Stampede A_Desert_Planet
## 2 Watson Indigo Sherlock London_(13)
## 3 JiJi Indigo <NA> Koriko
## 4 Billy Blue <NA> Eville
## 5 Shaggy Green Scooby Doo Coolsville
## 6 Chewbacca Orange <NA> Corellia
## 7 Harry Potter Red Hermione Granger HOGWARTS
## 8 Konstanze Conradt Orange Konrad Franco Santa_Rosa
## 9 Hermione Granger Yellow <NA> Hogwarts
## 10 Robin Yellow Batman Gotham
In order to join these dataframes and reunite our duos, we need to choose some method to match on. For this activity, we will be using one of the key based methods we covered above. However, data is rarely in a clean and usable state from the start. Often the bulk of your time will be spent preparing the data to do the actual task you want to accomplish.
Checkpoint
Spoiler Space
Don’t scroll past here until you’ve answered the above questions!
For our example, we will be using the hometown as out matching key; we ultimately want all of our duos to get home! However, the data gremlins have made this task more difficult, as the hometowns have various noise in them that prevents us from using them as a key. We’re going to have to repair the damage before we can reunite our duos.
We need to make our hometowns exact between out duos to use them as a key. To do this, we will need to standardize the capitalization, and the space deliminators in the hometown column. Let’s take care of the capitalization first.
While we could spend time making sure all of our hometowns are properly capitalized, it isn’t really necessary in this application; we just need them to match. An easy way to do that is to set everything to upper or lower case. Let’s use lower case here, making use of the tolower function in R. It will convert any character vector into all lower case.
First for the california:
california$hometown = tolower(california$hometown)
And then the massachusetts:
massachusetts$hometown = tolower(massachusetts$hometown)
Let’s pause to look at our work …
head(california, 10)
## name color alter hometown
## 1 Sherlock Blue <NA> london
## 2 Batman Red Robin gotham
## 3 Mandy Red <NA> eville
## 4 Aang Orange Katara republic city
## 5 Link Green <NA> castle town
## 6 Han Solo Blue <NA> corellia
## 7 Kiki Red <NA> koriko
## 8 Vash the Stampede Red Nicholas D. Wolfwood a desert planet
## 9 Konrad Franco Red Konstanze Conradt santa rosa
## 10 Scooby Doo Orange Shaggy coolsville
Looks good. Now let’s deal with the spacing. If you look at our dataframes and compare across them, you’ll notice some hometowns have underscores in them. This seems to only happen when there would be a space. We can use that rule to replace all the underscores with a space instead. To do this, we’ll use a gsub function, or “global substitution.” It will look over all of the strings, and globally replace a pattern we specify with a substitution. In this case, we want to replace all "_" with " ".
# gsub(Pattern, replacement, X)
# pattern is what we want to replace
# replacement is what we want to replace the pattern
# X is the vector of character strings we want to run this replacement on
california$hometown = gsub(pattern = "_", replacement = " ", x = california$hometown)
And now the massachusetts:
massachusetts$hometown = gsub(pattern = "_", replacement = " ", x = massachusetts$hometown)
Let’s check our work …
head(california, 10)
## name color alter hometown
## 1 Sherlock Blue <NA> london
## 2 Batman Red Robin gotham
## 3 Mandy Red <NA> eville
## 4 Aang Orange Katara republic city
## 5 Link Green <NA> castle town
## 6 Han Solo Blue <NA> corellia
## 7 Kiki Red <NA> koriko
## 8 Vash the Stampede Red Nicholas D. Wolfwood a desert planet
## 9 Konrad Franco Red Konstanze Conradt santa rosa
## 10 Scooby Doo Orange Shaggy coolsville
Looking good! It seems we have a consistent key. Now for the ultimate task of reuniting our duos.
Unfortunately, our duos were not split evenly. Some duos both got sent to the same place, while others were sent to different places. As with many data science problems, there are a number of ways to overcome this, and no “best” answer.
Checkpoint
Spoiler Space
Don’t scroll past here until you’ve answered the above questions!
I’ll walk you through one method, but stress that this is not an “optimal” answer, it is just one of many. For my solution, I will aggregate all of the individuals into one dataframe, and then split them so that each new dataframe has only one instance of each hometown.
First, I’ll combine all my individuals. You can add on rows to a dataframe using rbind much the same way you can add on columns using cbind in R. However, rbind is typically less fraught, as rows are typically conceptualized as individual and independent of each other, unlike columns which typically show data about rows. rbind essentially stacks two dataframes on top of each other, or appends the rows of one dataframe to another.
# combine all out individuals into one dataframe
individuals = rbind(massachusetts, california)
Now all of our individuals are in a single dataframe. Our next task is to split them evenly, so one member of each duo, by hometown, is in a separate dataframe. For this we can make use of the duplicated function. duplicated checks if a specific value has appeared already within a vector. For example, if you had a vector of c("A", "B", "B", "C", "D", "C"), the second B and second C would be considered duplicated, and thus output TRUE while all other values would output FALSE.
In our case, if we run duplicated over the hometown column for our individuals, and split the TRUE into one datafarme and FALSE into another, we should end up with two dataframes, each containing one individual from a hometown. Let’s give it a try. First, let’s see what it looks like when we run duplicated. We’ll assign it to our individuals dataframe so it is easier to see.
# run duplicated over hometowns and assign as a new column in our individuals dataframe
individuals$duplicated = duplicated(individuals$hometown)
# show the dataframe
individuals
## name color alter hometown duplicated
## 1 Nicholas D. Wolfwood Blue Vash the Stampede a desert planet FALSE
## 2 Watson Indigo Sherlock london (13) FALSE
## 3 JiJi Indigo <NA> koriko FALSE
## 4 Billy Blue <NA> eville FALSE
## 5 Shaggy Green Scooby Doo coolsville FALSE
## 6 Chewbacca Orange <NA> corellia FALSE
## 7 Harry Potter Red Hermione Granger hogwarts FALSE
## 8 Konstanze Conradt Orange Konrad Franco santa rosa FALSE
## 9 Hermione Granger Yellow <NA> hogwarts TRUE
## 10 Robin Yellow Batman gotham FALSE
## 11 Katara Blue Aang republic city FALSE
## 12 Watson Indigo <NA> london FALSE
## 13 Zelda Violet <NA> castle town FALSE
## 14 Lorelai Gilmore Orange Rory Gilmore stars hollow FALSE
## 15 Rory Gilmore Yellow <NA> stars hollow TRUE
## 16 Sherlock Blue <NA> london TRUE
## 17 Batman Red Robin gotham TRUE
## 18 Mandy Red <NA> eville TRUE
## 19 Aang Orange Katara republic city TRUE
## 20 Link Green <NA> castle town TRUE
## 21 Han Solo Blue <NA> corellia TRUE
## 22 Kiki Red <NA> koriko TRUE
## 23 Vash the Stampede Red Nicholas D. Wolfwood a desert planet TRUE
## 24 Konrad Franco Red Konstanze Conradt santa rosa TRUE
## 25 Scooby Doo Orange Shaggy coolsville TRUE
## 26 Sherlock Blue Watson london (13) TRUE
Looks good! Now we can use that new column to evenly split our dataframe, and them cleanly pair our duos back together! First let’s evenly split them:
# assign to a new duo1 dataframe, the contents of dataframe "individuals"
# SUCH THAT
# the rows whose value in column `individuals$duplicated` is equal to TRUE
# and give me all columns
duo1 = individuals[individuals$duplicated == TRUE, ]
# assign to a new duo1 dataframe, the contents of dataframe "individuals"
# SUCH THAT
# the rows whose value in column `individuals$duplicated` is equal to TRUE
# and give me all columns
duo2 = individuals[individuals$duplicated == FALSE, ]
Alright, now we have two dataframes, each containing one member of a duo. Time to finally merge them back together! We can do this using the merge function we learned about previously. Let’s try an inner merge.
# assign to a new reunion dataframe
# the result of a inner merge between dataframes `duo1` and `duo2`
# using column "hometown" as the key
reunion = merge(duo1, duo2, by = "hometown")
# show results
reunion
## hometown name.x color.x alter.x duplicated.x
## 1 a desert planet Vash the Stampede Red Nicholas D. Wolfwood TRUE
## 2 castle town Link Green <NA> TRUE
## 3 coolsville Scooby Doo Orange Shaggy TRUE
## 4 corellia Han Solo Blue <NA> TRUE
## 5 eville Mandy Red <NA> TRUE
## 6 gotham Batman Red Robin TRUE
## 7 hogwarts Hermione Granger Yellow <NA> TRUE
## 8 koriko Kiki Red <NA> TRUE
## 9 london Sherlock Blue <NA> TRUE
## 10 london (13) Sherlock Blue Watson TRUE
## 11 republic city Aang Orange Katara TRUE
## 12 santa rosa Konrad Franco Red Konstanze Conradt TRUE
## 13 stars hollow Rory Gilmore Yellow <NA> TRUE
## name.y color.y alter.y duplicated.y
## 1 Nicholas D. Wolfwood Blue Vash the Stampede FALSE
## 2 Zelda Violet <NA> FALSE
## 3 Shaggy Green Scooby Doo FALSE
## 4 Chewbacca Orange <NA> FALSE
## 5 Billy Blue <NA> FALSE
## 6 Robin Yellow Batman FALSE
## 7 Harry Potter Red Hermione Granger FALSE
## 8 JiJi Indigo <NA> FALSE
## 9 Watson Indigo <NA> FALSE
## 10 Watson Indigo Sherlock FALSE
## 11 Katara Blue Aang FALSE
## 12 Konstanze Conradt Orange Konrad Franco FALSE
## 13 Lorelai Gilmore Orange Rory Gilmore FALSE
A bit messy still, but we did it! Each duo has been reunited in one row. Now we can clean up the results a bit, and arrive at our desired outcome. You’ll notice some column names now have a “.x” or “.y” at the end. This is because in the original duo dataframes, the columns had the same name. Since two columns can’t have the same name once we join them in a single dataframe, the merge function add these suffixes to denote which column came from which dataframe. The columns from the first dataframe in the merge function always gets the “.x” ending, while the second gets the “.y” ending.
# first let's keep only the columns we want
# assign to new dataframe `outcome` the contents of `reunion`
# SUCH THAT
# all rows
# and the columns are named
outcome = reunion[ , c("name.x", "color.x", "name.y", "color.y", "hometown")]
Lastly, we can rename our columns.
# rename columns of dataframe `outcome`
colnames(outcome) = c("mem_1", "mem_1_color", "mem_2", "mem_2_color", "hometown")
# show results
head(outcome, 10)
## mem_1 mem_1_color mem_2 mem_2_color
## 1 Vash the Stampede Red Nicholas D. Wolfwood Blue
## 2 Link Green Zelda Violet
## 3 Scooby Doo Orange Shaggy Green
## 4 Han Solo Blue Chewbacca Orange
## 5 Mandy Red Billy Blue
## 6 Batman Red Robin Yellow
## 7 Hermione Granger Yellow Harry Potter Red
## 8 Kiki Red JiJi Indigo
## 9 Sherlock Blue Watson Indigo
## 10 Sherlock Blue Watson Indigo
## hometown
## 1 a desert planet
## 2 castle town
## 3 coolsville
## 4 corellia
## 5 eville
## 6 gotham
## 7 hogwarts
## 8 koriko
## 9 london
## 10 london (13)
Checkpoint
That was hopefully a fun spreadsheet-based adventure, and I’m glad it had a happy ending. But how can we use what we’ve learned in the real world? The inspiration for this lesson came from a task I needed to accomplish for a student organization I helped run, called Hack 4 California. The lab would meet weekly to work on civic data projects for our community, from small projects like helping the campus food pantry with usage statistics, and to full-fledged examinations of government data regarding exposure to hazardous materials in prisons with collaborators at UCLA.
To continue learning from the experience, my co-lead and I wanted to calculate attendance, and reach out to everyone who had participated to ask them what they thought! However, we had an issue. Because of the pandemic, half of our attendance data was in PDFs of excel sheets, and the other half was in zoom attendance logs. In total we had two PDFs, and 28 Zoom logs. You can see versions of these files with fake names below.
PDF Attendance Data
Zoom Log Attendance Data
The data formats are not ideal (Why PDFs of an excel sheet?), but such is the case with most projects. You can import versions of these dataframes to work with using the following code:
source("https://raw.githubusercontent.com/Epsian/table_join_lecture/main/src/3_real_examples.R")
We wanted to create a single dataframe that contained complete attendance data, so that we could see how many times each person attended. Using the skills you’ve learned in this lecture and the course so far, create a single dataframe which contains a row for each person, and columns containing their name, a count of how many times they attended, and their email if available.
Tips